Homework 3 - LIME (dataset: Information - for uplift modeling)

Author: Paulina Tomaszewska

NOTE: There is no detailed description of column names (the email to the package author was already sent). Only available explanation:

A data frame with 10000 rows and 70 variables:

  • TREATMENT: equals 1 if the person received the marketing offer, and 0 if the person was in the control group
  • PURCHASE: equals 1 if the person accepted the offer, and 0 otherwise
  • UNIQUE_ID: unique identifier
  • AGE: age of the person
  • D_REGION_X: 1 if the person lives in region X, 0 otherwise (3 regions: A, B, C)
  • Other variables are from credit bureau data (e.g., N_OPEN_REV_ACTS = number of open revolving accounts)
In [26]:
import numpy as np
import pandas as pd

Load dataset

In [27]:
import pyreadr

train = pyreadr.read_r('hmc_train.Rda')['train']  
valid = pyreadr.read_r('hmc_valid.Rda')['valid']


def xy_split(data, y_name="PURCHASE"):
    return data.drop([y_name], axis=1), data[y_name]


X_train, Y_train = xy_split(train)
X_test, Y_test = xy_split(valid)

It turned out that the column order in validation set is different than in training set.

In [28]:
X_test = X_test[X_train.columns]

Short EDA

In [29]:
X_train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 69 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   TREATMENT                        10000 non-null  float64
 1   M_SNC_MST_RCNT_ACT_OPN           10000 non-null  float64
 2   TOT_HI_CRDT_CRDT_LMT             10000 non-null  float64
 3   RATIO_BAL_TO_HI_CRDT             10000 non-null  float64
 4   AGRGT_BAL_ALL_XCLD_MRTG          10000 non-null  float64
 5   N_OF_SATISFY_FNC_REV_ACTS        10000 non-null  float64
 6   AVG_BAL_ALL_FNC_REV_ACTS         10000 non-null  float64
 7   N_BANK_INSTLACTS                 10000 non-null  float64
 8   M_SNCOLDST_BNKINSTL_ACTOPN       10000 non-null  float64
 9   N_FNC_INSTLACTS                  10000 non-null  float64
 10  N_SATISFY_INSTL_ACTS             10000 non-null  float64
 11  M_SNC_MSTREC_INSTL_TRD_OPN       10000 non-null  float64
 12  TOT_INSTL_HI_CRDT_CRDT_LMT       10000 non-null  float64
 13  M_SNC_OLDST_MRTG_ACT_OPN         10000 non-null  float64
 14  M_SNC_MSTRCNT_MRTG_ACT_UPD       10000 non-null  float64
 15  M_SNC_MST_RCNT_MRTG_DEAL         10000 non-null  float64
 16  N30D_ORWRS_RTNG_MRTG_ACTS        10000 non-null  float64
 17  N_OF_MRTG_ACTS_DLINQ_24M         10000 non-null  float64
 18  N_SATISFY_PRSNL_FNC_ACTS         10000 non-null  float64
 19  RATIO_PRSNL_FNC_BAL2HICRDT       10000 non-null  float64
 20  TOT_OTHRFIN_HICRDT_CRDTLMT       10000 non-null  float64
 21  N_SATISFY_OIL_NATIONL_ACTS       10000 non-null  float64
 22  M_SNCOLDST_OIL_NTN_TRD_OPN       10000 non-null  float64
 23  N_BC_ACTS_OPN_IN_12M             10000 non-null  float64
 24  N_BC_ACTS_OPN_IN_24M             10000 non-null  float64
 25  AVG_BAL_ALL_PRM_BC_ACTS          10000 non-null  float64
 26  N_RETAIL_ACTS_OPN_IN_24M         10000 non-null  float64
 27  M_SNC_OLDST_RETAIL_ACT_OPN       10000 non-null  float64
 28  RATIO_RETAIL_BAL2HI_CRDT         10000 non-null  float64
 29  TOT_BAL_ALL_DPT_STORE_ACTS       10000 non-null  float64
 30  N_30D_RATINGS                    10000 non-null  float64
 31  N_120D_RATINGS                   10000 non-null  float64
 32  N_30D_AND_60D_RATINGS            10000 non-null  float64
 33  N_ACTS_WITH_MXD_3_IN_24M         10000 non-null  float64
 34  N_ACTS_WITH_MXD_4_IN_24M         10000 non-null  float64
 35  PRCNT_OF_ACTS_NEVER_DLQNT        10000 non-null  float64
 36  N_ACTS_90D_PLS_LTE_IN_6M         10000 non-null  float64
 37  TOT_NOW_LTE                      10000 non-null  float64
 38  N_DEROG_PUB_RECS                 10000 non-null  float64
 39  N_INQUIRIES                      10000 non-null  float64
 40  N_FNC_ACTS_VRFY_IN_12M           10000 non-null  float64
 41  N_OPEN_REV_ACTS                  10000 non-null  float64
 42  N_FNC_ACTS_OPN_IN_12M            10000 non-null  float64
 43  HI_RETAIL_CRDT_LMT               10000 non-null  float64
 44  N_PUB_REC_ACT_LINE_DEROGS        10000 non-null  float64
 45  M_SNC_MST_RCNT_60_DAY_RTNG       10000 non-null  float64
 46  N_DISPUTED_ACTS                  10000 non-null  float64
 47  AUTO_HI_CRDT_2_ACTUAL            10000 non-null  float64
 48  MRTG_1_MONTHLY_PAYMENT           10000 non-null  float64
 49  MRTG_2_CURRENT_BAL               10000 non-null  float64
 50  PREM_BANKCARD_CRED_LMT           10000 non-null  float64
 51  STUDENT_HI_CRED_RANGE            10000 non-null  float64
 52  AUTO_2_OPEN_DATE_YRS             10000 non-null  float64
 53  MAX_MRTG_CLOSE_DATE              10000 non-null  float64
 54  UPSCALE_OPEN_DATE_YRS            10000 non-null  float64
 55  STUDENT_OPEN_DATE_YRS            10000 non-null  float64
 56  FNC_CARD_OPEN_DATE_YRS           10000 non-null  float64
 57  AGE                              10000 non-null  float64
 58  D_DEPTCARD                       10000 non-null  float64
 59  D_REGION_A                       10000 non-null  float64
 60  D_REGION_B                       10000 non-null  float64
 61  D_REGION_C                       10000 non-null  float64
 62  D_NA_M_SNC_MST_RCNT_ACT_OPN      10000 non-null  float64
 63  D_NA_AVG_BAL_ALL_FNC_REV_ACTS    10000 non-null  float64
 64  D_NA_M_SNCOLDST_BNKINSTL_ACTOPN  10000 non-null  float64
 65  D_NA_M_SNC_OLDST_MRTG_ACT_OPN    10000 non-null  float64
 66  D_NA_M_SNC_MST_RCNT_MRTG_DEAL    10000 non-null  float64
 67  D_NA_RATIO_PRSNL_FNC_BAL2HICRDT  10000 non-null  float64
 68  UNIQUE_ID                        10000 non-null  float64
dtypes: float64(69)
memory usage: 5.3 MB

All variables are numeric. There is no missing data.

In [30]:
Y_train.sum()/len(Y_train)
Out[30]:
0.1996

Dataset is largely imbalanced - class 1 constitues only 20%.

Train Classifier

In [31]:
from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegression

def train_xgb_model(X_train, Y_train, X_valid, Y_valid):
    xgmodel = XGBClassifier(max_depth=5,
                            learning_rate=0.05,
                            n_estimators=100,
                            objective='binary:logistic',
                            gamma=0.01)
    xgmodel.fit(X_train, Y_train, verbose=True)
    valid_score = xgmodel.score(X_valid, Y_valid)
    print("xgboost valid score {}".format(valid_score))
    return xgmodel
In [32]:
def train_xgb_model1(X_train, Y_train, X_valid, Y_valid):
    xgmodel = XGBClassifier(max_depth=13,
                            learning_rate=0.15,
                            n_estimators=100,
                            objective='binary:logistic',
                            gamma=0.01)
    xgmodel.fit(X_train, Y_train, verbose=True)
    valid_score = xgmodel.score(X_valid, Y_valid)
    print("xgboost valid score {}".format(valid_score))
    return xgmodel
In [33]:
xgbmodel = train_xgb_model(X_train.values, Y_train, X_test.values, Y_test)
xgboost valid score 0.8401

LIME

In [34]:
import lime.lime_tabular
In [35]:
explainer = lime.lime_tabular.LimeTabularExplainer(X_train.values, feature_names=X_train.columns, 
                                                   class_names=['no purchase', 'purchase'], discretize_continuous=True)
In [36]:
model_xgb_predict=lambda x: xgbmodel.predict_proba(x).astype(float)
In [37]:
def explain_observation(index, model_predict):
    exp = explainer.explain_instance(X_test.values[index], model_predict, num_features=5)
    exp.show_in_notebook(show_table=True, show_all=False)
    return exp.as_list()

OBSERVATION #2

In [38]:
explain_observation(2, model_xgb_predict)
Out[38]:
[('N_OPEN_REV_ACTS <= 0.00', -0.10742646125447863),
 ('MRTG_1_MONTHLY_PAYMENT <= 0.00', 0.0801229294964264),
 ('D_REGION_A <= 0.00', 0.05771264333350782),
 ('N_FNC_INSTLACTS <= 0.00', -0.047233060327099535),
 ('D_DEPTCARD <= 0.00', 0.046133607897402215)]

Conclusion

The model was almost certain that the correct class is the one saying that the client didn't buy a product (no purchase). The biggest impact on the decision had:

  • N_OPEN_REV_ACTS<=0 (number of open revolving accounts is equal to zero)
  • N_FNC_INSTLACTS<=0

Note: The first variable has twice as big impact as the second.

Whereas the variables that are for the class "purchase" are:

  • D_DEPTCARD<=0 (I suspect that it is about debit card),
  • MRTG_1_MONTHLY_PAYMENT<=0 (it concerns monthly payment for the mortage meaning the person doesn't have mortage),
  • D_REGION_A<=0 (it means that the person doesn't live in region A).

The explaination seems reasonable - the person that doesn't have open accounts (meaning he doesn't have spare money) is not likely to buy new products. Whereas information that the person doesn't have mortage means that the person has less duties so in case he has extra money he can buy sth new. Also based on LIME output I suppose that region A is needy as not living there makes the model to decide that the person will buy a product.

OBSERVATION #9

In [39]:
explain_observation(9, model_xgb_predict)
Out[39]:
[('STUDENT_HI_CRED_RANGE <= 0.00', 0.08737011930602381),
 ('0.00 < N_OPEN_REV_ACTS <= 3.00', -0.08100359662775011),
 ('D_REGION_A <= 0.00', 0.06195845255941211),
 ('M_SNC_OLDST_MRTG_ACT_OPN > 138.96', 0.056804504607831784),
 ('D_DEPTCARD <= 0.00', 0.04420250872948519)]

Conclusion

The model had a problem to classify the observation (each of classes got 0.5 probability). The biggest impact on the decision that the correct class is "purchase" had:

  • STUDENT_HI_CRED_RANGE<=0 (I suspect that it is about student credit range),
  • D_REGION_A<=0 (it means that the person doesn't live in region A)
  • M_SNC_OLDST_MRTG_ACT_OPN>138.96 (it refers to something connected to mortage, this observation had value 350)
  • D_DEPTCARD<=0 (it probably refers to debit card)

Whereas the variable that is for the class "no purchase" is:

  • 0< N_OPEN_REV_ACTS<=3.00 (number of open revolving accounts, this observation had value 2)

OBSERVATION #11

In [40]:
explain_observation(11, model_xgb_predict)
Out[40]:
[('N_OPEN_REV_ACTS > 7.00', 0.22242633726499192),
 ('MRTG_1_MONTHLY_PAYMENT > 414.00', -0.08092426955358029),
 ('STUDENT_HI_CRED_RANGE <= 0.00', 0.08082546928652037),
 ('D_REGION_A <= 0.00', 0.055812048968474574),
 ('M_SNC_OLDST_RETAIL_ACT_OPN > 164.24', 0.04670292441378128)]

Conclusion

The model pointed as correct the class "purchase". The biggest impact on this decision had:

  • N_OPEN_REV_ACTS>7.00 (number of open revolving accounts, this observation had value 27)
  • STUDENT_HI_CRED_RANGE<=0 (I suspect that it is about student credit range),
  • D_REGION_A<=0 (it means that the person doesn't live in region A)
  • M_SNC_OLDST_RETAIL_ACT_OPN>164.24 (this observation had value 209)

Whereas the variable that is for the class "no purchase" is:

  • MRTG_1_MONTHLY_PAYMENT > 414 (it refers to monthly mortage payment, this observation had value 1298)

The explanation seems appropriate. The person having many accounts and not living in needy region A can afford new purchase. However the information that the person has high montly mortage payment implies the decision that he won't make a purchase.

The explanation across the observations seems stable. In each of the analysed observation as the important variables were pointed:

  • N_OPEN_REV_ACTS
  • D_REGION_A
  • MRTG_1_MONTLY_PAYMENT

MLP Classifier

In [41]:
from sklearn.neural_network import MLPClassifier
In [42]:
mlp=MLPClassifier(solver='adam', alpha=1e-5, hidden_layer_sizes=(50, 5), random_state=1)
mlp.fit(X_train, Y_train)
Out[42]:
MLPClassifier(activation='relu', alpha=1e-05, batch_size='auto', beta_1=0.9,
              beta_2=0.999, early_stopping=False, epsilon=1e-08,
              hidden_layer_sizes=(50, 5), learning_rate='constant',
              learning_rate_init=0.001, max_fun=15000, max_iter=200,
              momentum=0.9, n_iter_no_change=10, nesterovs_momentum=True,
              power_t=0.5, random_state=1, shuffle=True, solver='adam',
              tol=0.0001, validation_fraction=0.1, verbose=False,
              warm_start=False)
In [43]:
valid_score = mlp.score(X_test, Y_test)
print("MLP valid score {}".format(valid_score))
MLP valid score 0.7987

It seems that the model learned to predict in every case the label "no purchase". Further parameter tuning would be needed but this is not the goal of this homework.

OBSERVATION #2

In [44]:
explain_observation(2, mlp.predict_proba)
Out[44]:
[('TOT_INSTL_HI_CRDT_CRDT_LMT <= 0.00', -0.23918417674736356),
 ('TOT_HI_CRDT_CRDT_LMT <= 0.00', 0.14027973843690275),
 ('MRTG_2_CURRENT_BAL <= 0.00', 0.11039175784922253),
 ('TOT_OTHRFIN_HICRDT_CRDTLMT <= 0.00', -0.06621082935491378),
 ('AGRGT_BAL_ALL_XCLD_MRTG <= 0.00', -0.03989323166808738)]

Conclusion

MLPClassifier as the XGBoost was certain that the correct class is "no purchase". The decision was based on:

  • TOT_INSTL_HI_CRDT_CRDT_LMT <=0
  • TOT_OTHRFIN_HICRDT_CRDTLMT <=0
  • AGRGT_BAL_ALL_XCLD_MRTG <=0

Whereas the following values were suggesting that the correct label is "purchase":

  • TOT_HI_CRDT_CRDT_LMT<=0
  • MRTG_2_CURRENT_BAL<=0

Note: The MLPClassifier took into account other parameters than XGBoost (of the same accuracy) to make decision about the same observation.

OBSERVATION #9

In [45]:
explain_observation(9, mlp.predict_proba)
Out[45]:
[('TOT_INSTL_HI_CRDT_CRDT_LMT <= 0.00', -0.2379900524060638),
 ('MRTG_2_CURRENT_BAL <= 0.00', 0.09524992195931001),
 ('23911.50 < TOT_HI_CRDT_CRDT_LMT <= 110366.25', -0.09177405449878827),
 ('TOT_OTHRFIN_HICRDT_CRDTLMT <= 0.00', -0.07279633859352576),
 ('AUTO_HI_CRDT_2_ACTUAL <= 0.00', 0.04574022545779912)]

Conclusion

In case of XGBoost classifier the model didn't know which class was the correct one, whereas MLPClassifier pointed the label "no purchase". Arguments for the label "no purchase":

  • TOT_INSTL_HI_CRDT_CRDT_LMT <= 0.00
  • 23911.50 < TOT_HI_CRDT_CRDT_LMT <= 110366.25 (where the observation had value 101918)
  • TOT_OTHRFIN_HICRDT_CRDTLMT <= 0.00

Arguments for the label "purchase":

  • MRTG_2_CURRENT_BAL <= 0.00
  • AUTO_HI_CRDT_2_ACTUAL <= 0.00

The explanation seems appropriate - the information that someone doesn't have mortage let me think that the person can affort purchase. MLPClassifier takes into account different arguments than XGBoost.

OBSERVATION #11

In [46]:
explain_observation(11, mlp.predict_proba)
Out[46]:
[('TOT_INSTL_HI_CRDT_CRDT_LMT > 13075.50', 0.2993715198708487),
 ('TOT_HI_CRDT_CRDT_LMT > 110366.25', -0.14685244791289614),
 ('MRTG_2_CURRENT_BAL <= 0.00', 0.10538752603798048),
 ('AGRGT_BAL_ALL_XCLD_MRTG > 16330.25', 0.09576972270208714),
 ('TOT_OTHRFIN_HICRDT_CRDTLMT > 400.00', 0.08475508961837412)]

Conclusion

As noted above, the model seems predict that the correct class is "no purchase" in every case - the probability is equal the ratio of the class 0. Arguments for class "no purchase":

  • TOT_HI_CRDT_CRDT_LMT > 110366.25 (where the observation had value 225795)

Arguments for the class "purchase":

  • TOT_INSTL_HI_CRDT_CRDT_LMT > 13075.50 (where the observation had value 101000)
  • MRTG_2_CURRENT_BAL <= 0.00',
  • AGRGT_BAL_ALL_XCLD_MRTG > 16330.25 (where the observation had value 105492),
  • TOT_OTHRFIN_HICRDT_CRDTLMT > 400.00 (where the observation had value 1000)

The explanation is thought-provoking. The fact that the person doesn't have mortage (MRTG_2_CURRENT_BAL =0) is for the decion that he will buy product. But why AGRGT_BAL_ALL_XCLD_MRTG > 16330.25 also mean that there will be purchase? (seems wierd and would require further investigation)

In [ ]:
 
In [ ]: